Skip to main content

Data cleaning

The first thing we need to do is make sure our data is entered accurately. We do this by making sure the data is believable.

So the first thing we might do is look at the maximum and minimum in each column.

Leave a line at the bottom of the database and then write

=max(a2:a31)

=min(a2:a31)

This will calculate the maximum number in the column and the minimum number in the column.

Do this for all the columns (you can copy and paste the formula across the database). It should look like this:

Username Gender Age at first symptoms Age Disease duration Group Baseline wellbeing End of study wellbeing
Max 1 40 64 42 1 63 71.2
Min 0 14 20 0 0 1 4


If this data is correct then the data should make sense. So

Gender - we don't have any one who is not a 1 or 0 so that is fine

Age at first symptoms - we don't have anyone who had symptoms at birth or at age 140 so that looks fine.

Age - these are within age ranges that could be feasible.

Disease duration - someone could have had a disease duration of 42 years if they got the condition at age 14. But we should check this as it seems a bit long

- someone could have just developed disease so could have a disease duration of 0 years, but we should check this.

The wellbeing score is supposed to be between 0 and 100 so this is fine.

1

After checking the disease duration (Age - age at symptom onset = disease duration) was there any problem with the database

a)
b)
Yes, there is a problem with the database, participant with user name 6 is not entered correctlyYes, there is a problem with the database, participant with user name 6 is not entered correctly.Your answer has been saved.
Check your answer